9장 비지도학습

주요 내용

참고: 핵심 설명과 코드는 🔑로 표시되었으며 굳이 알아둘 필요가 없는 코드는 ✋로 표시되었다.

기본 설정

9.1 군집화

분류 대 군집화

아래 코드는 분류와 군집화의 차이를 보여주는 그림을 그린다.

꽃잎 길이와 너비만으로는 두 개의 군집으로만 구분이 가능해 보인다. 하지만 꽃잎의 길이와 너비와 더불어 꽃받침의 길이와 너비까지 포함한 네 개의 특성을 모두 사용하여 후반부에서 가우시안 혼합 모델을 이용하여 세 개의 군집으로 나눌 수 있다.

실제 타깃을 이용한 분류와 직접 비교하면 다음과 같으며, 버시컬러와 버지니카 품종의 군집이 거의 정확하게 구분된 것을 확인할 수 있다.

군집화의 정확도를 확인하기 위해 군집별로 가장 많이 포함된 품종, 즉, 품종의 최빈값(mode)을 확인한다.

아래 코드는 사이파이(scipy)의 통계 모듈에 포함되어 있는 mode() 함수를 이용하여 각 군집별 최빈값을 확인한 후에 해당 최빈값과 군집 인덱스를 연결(mapping)한다.

최종 결과는 다음과 같다.

mapping을 이용하여 군집 인덱스를 품종 인덱스로 변경한 후에 군집화의 정확도를 측정하면 96.7% 가 나온다.

참고: 사용하는 사이킷런 버전에 따라 조금씩 다른 결과가 나올 수 있다.

9.1.1 K-평균

먼저 2,000개의 데이터 샘플을 생성한다. 생성되는 데이터는 지정된 5개의 센터를 중심으로 지정된 표준편차를 따르는 원 모양의 데이터 군집을 이룬다. 또한 각각의 군집은 거의 동일한 크기를 갖는다.

make_blobs() 함수가 앞서 설명한 방식으로 데이터를 생성한다.

참고: 각 샘플에 대한 실제 군집 인덱스가 타깃으로 함께 제공되지만 여기서는 사용하지 않는다.

산점도를 그리면 다음과 같다.

군집화 훈련과 예측

사이킷런 KMeans 모델의 fit() 메서드는 각 군집의 중심(센트로이드)을 잡은 다음에 모든 샘플에 대해 가장 가까운 센트로이드를 중심으로 하는 군집을 이루도록 한다. 다만, 센트로이드를 몇 개로 할 것인지 먼저 지정해야 하며, 여기서는 5개를 사용하여 훈련시킨다.

훈련 결과 kmeans.labels_ 속성에 각 훈련 샘플에 대한 군집 인덱스가 저장된다.

참고: 앞서 붓꽃 데이터 군집화에서 살펴본 것처럼 데이터가 생성될 때 지정된 실제 군집 인덱스가 아닌 훈련 과정에서 무작위로 지정된 인덱스임에 주의해야 한다.

각 군집의 중심, 즉 5개의 센트로이드의 좌표는 다음과 같다.

predict() 메서드를 이용하여 새로운 데이터에 대한 군집 예측도 가능하다.

결정경계

군집을 나누는 결정경계를 그리면 보로노이 다이어그램이 생성된다.

plot_data() 함수는 여기서만 사용되는 기본값이 지정된 산점도를 그린다.

plot_centroids() 함수는 센트로이드를 시각화한다.

plot_decision_boundaries() 함수는 결정경계를 시각화한다.

경계 근처의 일부 데이터가 잘못된 군집에 포함되긴 하였지만 전반적으로 군집이 잘 형성되었다.

군집화와 차원축소

지금까지 살펴 보았듯이 K-평균 모델 객체의 labels_ 속성은 각 샘플에 대해 가장 가까운 센트로이드를 중심으로 하는 군집의 (작위적으로 지정된) 인덱스를 저장하며, 이를 이용하여 predict() 메서드는 샘플이 속하는 군집의 인덱스를 반환한다. 이런 방식의 군집화가 하드 군집화(hard clustering)이다.

반면에 소프트 군집화(soft clustering)는 샘플과 각 군집 사이의 관계를 점수로 부여한다. 점수는 예를 들어 각 군집과 샘플사이의 거리 또는 5장에서 소개한 가우시안 방사기저 함수를 이용한 유사도 점수 등이 사용될 수 있다. K-평균 모델 객체의 transform() 메서드는 샘플과 각 센트로이드 사이의 (유클리드) 거리를 점수로 사용한다.

아래 코드는 네 개의 새 데이터를 변환하는 것을 보여준다.

✋ 앞서 설명한 대로 각 샘플에 대한 반환값은 각 센트로이드로부터의 (유클리드) 거리임을 아래와 같이 확인할 수 있다.

K-평균 모델의 transform() 메서드는 결론적으로 기존 $n$-차원의 데이터셋을 비선형적으로 $k$-차원의 데이터셋으로 변환하는 비선형 차원축소 기법으로 사용될 수 있다.

참고: 국소적 선형 임베딩 처럼 차원축소 기법을 군집화에 활용할 수도 있다.

K-평균 알고리즘

K-평균 알고리즘은 가장 빠르며 가장 간단한 군집화 알고리즘 중 하나이다.

사이킷런의 KMeans 클래스

KMeans 클래스의 기본 옵션 중 몇 가지는 다음과 같다.

여기서는 예시를 위해 아래 옵션을 대신 사용한다.

각 단계별 결정경계와 센트로이드의 변화는 다음과 같다.

센트로이드 초기화 문제와 해결법

초기화 문제

초기화가 무작위로 이루어질 경우 적절하지 않은 군집화를 얻을 수 있다. 아래 코드는 두 개의 나쁜 경우를 잘 보여준다.

plot_clusterer_comparison() 함수는 두 개의 결정경계 그래프를 동시에 그려준다.

아래 두 개의 그래프는 바로 이전의 경우처럼 센트로이드 초기화를 무작위로 딱 한 번 사용한 결과를 보여준다. 두 경우 모두 적절치 않는 모델을 생성한다.

관성(inertia, 이너셔)

비지도 학습인 K-평균 모델의 성능을 관성(inertia)을 이용하여 측정한다.

✋ 아래 코드는 관성이 각 훈련 샘플과 가장 가까운 센트로이드 사이의 거리를 모두 합한 값임을 확인해준다.

score() 메서드는 관성의 음숫값을 반환한다. 이유는 "더 높은 점수가 더 좋다"의 원칙을 따라야 하기 때문이다.

초기화 반복

무작위 초기와 문제를 해결하기 위해 K-평균 알고리즘의 초기화를 여러 번 실행한 다음에 가장 낮은 관성을 보이는 모델을 최종 모델로 선택하면 되며, 실제로 이 옵션(n_init=10)이 앞서 설명한 대로 KMeans 모델의 기본 하이퍼파라미터 값으로 설정되어 있다.

아래에서 확인할 수 있듯이 기본 하이퍼파라미터를 사용한 kmeans의 관성이 한 번의 센트로이드 초기화를 사용하는 모델들의 관성보다 낮다.

실제로 n_init=10로 설정할 경우 앞서 살펴본 좋은 모델과 비슷한 결과를 얻는다.

관성 점수도 거의 최저다.

K-Means++ 알고리즘

센트로이드 무작위 초기화 문제의 보다 근본적인 해결책이 아서(David Arthur)와 바실비츠기(Sergei Vassilvitskii)의 2006년 논문에서 제시되었다. 기본 아이디어는 기존 센트로이드들과의 거리가 멀 수록 다음 센트로이드로 선택될 확률이 높아지도록 하는 것이다.

  1. 균등분포를 따르면서 임의로 하나의 센트로이드 $c_1$ 선택 후 $k$ 개의 센트로이드를 지정할 때까지 아래 과정 반복
  2. $c_1, \dots, c_{i-1}$이 이미 선택되었가고 가정.

    1. 아래의 확률로 새로운 센트로이드 $c_i$로 $\mathbf{x}_i$ 선택:

      $$\frac{D(\mathbf{x}_i)^2}{\sum\limits_{j=1}^{m}{D(\mathbf{x}_j)}^2}$$

    2. $D(\mathbf{x}_j)$: $\mathbf{x}_j$와 이미 선택된 $c_1, \dots, c_{i-1}$ 중에서 가장 가까운 센트로이드 사이의 거리

      $$D(\mathbf{x}_j) = \min_{k<i} \| x_j - c_k \|$$

확률 계산으로 인해 초기화 비용이 좀 더 많이 들어가긴 하지만 결과적으로 초기화 횟수(n_init)를 획기적으로 줄일 수 있는 장점이 보다 크다. 따라서 사이킷런의 KMeans 모델의 기본값으로 사용된다.

데이터를 생성할 때 사용된 군집의 실제 중심과 kmeans 모델이 찾아낸 군집의 센트로이드가 매우 비슷함을 아래 코드에서 확인할 수 있다.

주의사항: 좌표의 순서가 다름에 주의하라.

init 하이퍼파라미터의 값으로 센트로이드의 좌표를 지정할 수 있다. 아래 코드는 적절한 센트로이드 좌표 5개를 지정하면 좋은 모델을 얻음을 보여준다.

주의사항: 아래 5개의 좌표가 앞서 언급한 blob_centers의 좌표와 유사하다. 여기서도 순서는 중요하지 않다.

개선된 K-평균 알고리즘과 미니배치 k-평균

개선된 K-평균 알고리즘

The K-Means algorithm can be significantly accelerated by avoiding many unnecessary distance calculations: this is achieved by exploiting the triangle inequality (given three points A, B and C, the distance AC is always such that AC ≤ AB + BC) and by keeping track of lower and upper bounds for distances between instances and centroids (see this 2003 paper by Charles Elkan for more details).

To use Elkan's variant of K-Means, just set algorithm="elkan". Note that it does not support sparse data, so by default, Scikit-Learn uses "elkan" for dense data, and "full" (the regular K-Means algorithm) for sparse data.

There's no big difference in this case, as the dataset is fairly small.

미니배치 K-평균

Scikit-Learn also implements a variant of the K-Means algorithm that supports mini-batches (see this paper):

If the dataset does not fit in memory, the simplest option is to use the memmap class, just like we did for incremental PCA in the previous chapter. First let's load MNIST:

Next, let's write it to a memmap:

If your data is so large that you cannot use memmap, things get more complicated. Let's start by writing a function to load the next batch (in real life, you would load the data from disk):

Now we can train the model by feeding it one batch at a time. We also need to implement multiple initializations and keep the model with the lowest inertia:

Mini-batch K-Means is much faster than regular K-Means:

That's much faster! However, its performance is often lower (higher inertia), and it keeps degrading as k increases. Let's plot the inertia ratio and the training time ratio between Mini-batch K-Means and regular K-Means:

최적의 군집 수

What if the number of clusters was set to a lower or greater value than 5?

Ouch, these two models don't look great. What about their inertias?

No, we cannot simply take the value of $k$ that minimizes the inertia, since it keeps getting lower as we increase $k$. Indeed, the more clusters there are, the closer each instance will be to its closest centroid, and therefore the lower the inertia will be. However, we can plot the inertia as a function of $k$ and analyze the resulting curve:

As you can see, there is an elbow at $k=4$, which means that less clusters than that would be bad, and more clusters would not help much and might cut clusters in half. So $k=4$ is a pretty good choice. Of course in this example it is not perfect since it means that the two blobs in the lower left will be considered as just a single cluster, but it's a pretty good clustering nonetheless.

Another approach is to look at the silhouette score, which is the mean silhouette coefficient over all the instances. An instance's silhouette coefficient is equal to $(b - a)/\max(a, b)$ where $a$ is the mean distance to the other instances in the same cluster (it is the mean intra-cluster distance), and $b$ is the mean nearest-cluster distance, that is the mean distance to the instances of the next closest cluster (defined as the one that minimizes $b$, excluding the instance's own cluster). The silhouette coefficient can vary between -1 and +1: a coefficient close to +1 means that the instance is well inside its own cluster and far from other clusters, while a coefficient close to 0 means that it is close to a cluster boundary, and finally a coefficient close to -1 means that the instance may have been assigned to the wrong cluster.

Let's plot the silhouette score as a function of $k$:

As you can see, this visualization is much richer than the previous one: in particular, although it confirms that $k=4$ is a very good choice, but it also underlines the fact that $k=5$ is quite good as well.

An even more informative visualization is given when you plot every instance's silhouette coefficient, sorted by the cluster they are assigned to and by the value of the coefficient. This is called a silhouette diagram:

As you can see, $k=5$ looks like the best option here, as all clusters are roughly the same size, and they all cross the dashed line, which represents the mean silhouette score.

9.1.2 k-평균의 한계

9.1.3 군집화와 이미지 분할

9.1.4 군집화와 전처리

Let's tackle the digits dataset which is a simple MNIST-like dataset containing 1,797 grayscale 8×8 images representing digits 0 to 9.

Let's split it into a training set and a test set:

Now let's fit a Logistic Regression model and evaluate it on the test set:

Okay, that's our baseline: 96.89% accuracy. Let's see if we can do better by using K-Means as a preprocessing step. We will create a pipeline that will first cluster the training set into 50 clusters and replace the images with their distances to the 50 clusters, then apply a logistic regression model:

How about that? We reduced the error rate by over 28%! But we chose the number of clusters $k$ completely arbitrarily, we can surely do better. Since K-Means is just a preprocessing step in a classification pipeline, finding a good value for $k$ is much simpler than earlier: there's no need to perform silhouette analysis or minimize the inertia, the best value of $k$ is simply the one that results in the best classification performance.

Warning: the following cell may take close to 20 minutes to run, or more depending on your hardware.

Let's see what the best number of clusters is:

9.1.5 군집화와 준지도학습

Another use case for clustering is in semi-supervised learning, when we have plenty of unlabeled instances and very few labeled instances.

Let's look at the performance of a logistic regression model when we only have 50 labeled instances:

It's much less than earlier of course. Let's see how we can do better. First, let's cluster the training set into 50 clusters, then for each cluster let's find the image closest to the centroid. We will call these images the representative images:

Now let's plot these representative images and label them manually:

Now we have a dataset with just 50 labeled instances, but instead of being completely random instances, each of them is a representative image of its cluster. Let's see if the performance is any better:

Wow! We jumped from 83.3% accuracy to 91.3%, although we are still only training the model on 50 instances. Since it's often costly and painful to label instances, especially when it has to be done manually by experts, it's a good idea to make them label representative instances rather than just random instances.

But perhaps we can go one step further: what if we propagated the labels to all the other instances in the same cluster?

We got a tiny little accuracy boost. Better than nothing, but we should probably have propagated the labels only to the instances closest to the centroid, because by propagating to the full cluster, we have certainly included some outliers. Let's only propagate the labels to the 20th percentile closest to the centroid:

A bit better. With just 50 labeled instances (just 5 examples per class on average!), we got 92.7% performance, which is getting closer to the performance of logistic regression on the fully labeled digits dataset (which was 96.9%).

This is because the propagated labels are actually pretty good: their accuracy is close to 96%:

You could now do a few iterations of active learning:

  1. Manually label the instances that the classifier is least sure about, if possible by picking them in distinct clusters.
  2. Train a new model with these additional labels.

9.1.6 DBSCAN

주의사항: 저자의 주피터 노트북에 사용된 아래 코드는 정확하지 않음. 이유는 dbscan.core_sample_indices_가 전체 dbscan.labels_ 가 아닌 경우 엉뚱한 답을 내 놓을 수도 있기 때문임.

y_dist, y_pred_idx = knn.kneighbors(X_new, n_neighbors=1)
y_pred = dbscan.labels_[dbscan.core_sample_indices_][y_pred_idx]
y_pred[y_dist > 0.2] = -1
y_pred.ravel()

9.1.7 기타 군집화 알고리즘

스펙트럴 군집화(Spectral Clustering)

응집 군집화(Agglomerative Clustering)

9.2 가우시안 혼합(Gaussian Mixtures)

Let's train a Gaussian mixture model on the previous dataset:

Let's look at the parameters that the EM algorithm estimated:

Did the algorithm actually converge?

Yes, good. How many iterations did it take?

You can now use the model to predict which cluster each instance belongs to (hard clustering) or the probabilities that it came from each cluster. For this, just use predict() method or the predict_proba() method:

This is a generative model, so you can sample new instances from it (and get their labels):

Notice that they are sampled sequentially from each cluster.

You can also estimate the log of the probability density function (PDF) at any location using the score_samples() method:

Let's check that the PDF integrates to 1 over the whole space. We just take a large square around the clusters, and chop it into a grid of tiny squares, then we compute the approximate probability that the instances will be generated in each tiny square (by multiplying the PDF at one corner of the tiny square by the area of the square), and finally summing all these probabilities). The result is very close to 1:

Now let's plot the resulting decision boundaries (dashed lines) and density contours:

You can impose constraints on the covariance matrices that the algorithm looks for by setting the covariance_type hyperparameter:

9.2.1 가우시안 혼합과 이상치 탐지

Gaussian Mixtures can be used for anomaly detection: instances located in low-density regions can be considered anomalies. You must define what density threshold you want to use. For example, in a manufacturing company that tries to detect defective products, the ratio of defective products is usually well-known. Say it is equal to 4%, then you can set the density threshold to be the value that results in having 4% of the instances located in areas below that threshold density:

9.2.2 모델 선택

We cannot use the inertia or the silhouette score because they both assume that the clusters are spherical. Instead, we can try to find the model that minimizes a theoretical information criterion such as the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC):

${BIC} = {\log(m)p - 2\log({\hat L})}$

${AIC} = 2p - 2\log(\hat L)$

Both BIC and AIC penalize models that have more parameters to learn (e.g., more clusters), and reward models that fit the data well (i.e., models that give a high likelihood to the observed data).

We could compute the BIC manually like this:

There's one weight per cluster, but the sum must be equal to 1, so we have one degree of freedom less, hence the -1. Similarly, the degrees of freedom for an $n \times n$ covariance matrix is not $n^2$, but $1 + 2 + \dots + n = \dfrac{n (n+1)}{2}$.

Let's train Gaussian Mixture models with various values of $k$ and measure their BIC:

Let's search for best combination of values for both the number of clusters and the covariance_type hyperparameter:

9.2.3 베이즈 가우시안 혼합 모델

Rather than manually searching for the optimal number of clusters, it is possible to use instead the BayesianGaussianMixture class which is capable of giving weights equal (or close) to zero to unnecessary clusters. Just set the number of components to a value that you believe is greater than the optimal number of clusters, and the algorithm will eliminate the unnecessary clusters automatically.

The algorithm automatically detected that only 3 components are needed:

Note: the fact that you see only 3 regions in the right plot although there are 4 centroids is not a bug. The weight of the top-right cluster is much larger than the weight of the lower-right cluster, so the probability that any given point in this region belongs to the top right cluster is greater than the probability that it belongs to the lower-right cluster.

Oops, not great... instead of detecting 2 moon-shaped clusters, the algorithm detected 8 ellipsoidal clusters. However, the density plot does not look too bad, so it might be usable for anomaly detection.

가능도(likelihood) 함수

연습문제

1. to 9.

부록 A 참조

10. Cluster the Olivetti Faces Dataset

Exercise: The classic Olivetti faces dataset contains 400 grayscale 64 × 64–pixel images of faces. Each image is flattened to a 1D vector of size 4,096. 40 different people were photographed (10 times each), and the usual task is to train a model that can predict which person is represented in each picture. Load the dataset using the sklearn.datasets.fetch_olivetti_faces() function.

Exercise: Then split it into a training set, a validation set, and a test set (note that the dataset is already scaled between 0 and 1). Since the dataset is quite small, you probably want to use stratified sampling to ensure that there are the same number of images per person in each set.

To speed things up, we'll reduce the data's dimensionality using PCA:

Exercise: Next, cluster the images using K-Means, and ensure that you have a good number of clusters (using one of the techniques discussed in this chapter).

It looks like the best number of clusters is quite high, at 120. You might have expected it to be 40, since there are 40 different people on the pictures. However, the same person may look quite different on different pictures (e.g., with or without glasses, or simply shifted left or right).

The optimal number of clusters is not clear on this inertia diagram, as there is no obvious elbow, so let's stick with k=100.

Exercise: Visualize the clusters: do you see similar faces in each cluster?

About 2 out of 3 clusters are useful: that is, they contain at least 2 pictures, all of the same person. However, the rest of the clusters have either one or more intruders, or they have just a single picture.

Clustering images this way may be too imprecise to be directly useful when training a model (as we will see below), but it can be tremendously useful when labeling images in a new dataset: it will usually make labelling much faster.

11. Using Clustering as Preprocessing for Classification

Exercise: Continuing with the Olivetti faces dataset, train a classifier to predict which person is represented in each picture, and evaluate it on the validation set.

Exercise: Next, use K-Means as a dimensionality reduction tool, and train a classifier on the reduced set.

Yikes! That's not better at all! Let's see if tuning the number of clusters helps.

Exercise: Search for the number of clusters that allows the classifier to get the best performance: what performance can you reach?

We could use a GridSearchCV like we did earlier in this notebook, but since we already have a validation set, we don't need K-fold cross-validation, and we're only exploring a single hyperparameter, so it's simpler to just run a loop manually:

Oh well, even by tuning the number of clusters, we never get beyond 80% accuracy. Looks like the distances to the cluster centroids are not as informative as the original images.

Exercise: What if you append the features from the reduced set to the original features (again, searching for the best number of clusters)?

That's a bit better, but still worse than without the cluster features. The clusters are not useful to directly train a classifier in this case (but they can still help when labelling new training instances).

12. A Gaussian Mixture Model for the Olivetti Faces Dataset

Exercise: Train a Gaussian mixture model on the Olivetti faces dataset. To speed up the algorithm, you should probably reduce the dataset's dimensionality (e.g., use PCA, preserving 99% of the variance).

Exercise: Use the model to generate some new faces (using the sample() method), and visualize them (if you used PCA, you will need to use its inverse_transform() method).

Exercise: Try to modify some images (e.g., rotate, flip, darken) and see if the model can detect the anomalies (i.e., compare the output of the score_samples() method for normal images and for anomalies).

The bad faces are all considered highly unlikely by the Gaussian Mixture model. Compare this to the scores of some training instances:

13. Using Dimensionality Reduction Techniques for Anomaly Detection

Exercise: Some dimensionality reduction techniques can also be used for anomaly detection. For example, take the Olivetti faces dataset and reduce it with PCA, preserving 99% of the variance. Then compute the reconstruction error for each image. Next, take some of the modified images you built in the previous exercise, and look at their reconstruction error: notice how much larger the reconstruction error is. If you plot a reconstructed image, you will see why: it tries to reconstruct a normal face.

We already reduced the dataset using PCA earlier: